Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use more appropriate types for cross-platform unicode support #19

Merged
merged 7 commits into from
Jan 17, 2020

Conversation

vjoki
Copy link
Contributor

@vjoki vjoki commented Dec 18, 2019

Interally things are switched over to using widestring and W/Ex variants of the unrar_sys functions, as these seem to be preferred (unrar internally converts non-wchar strings). This is also required for unicode support on Windows (UTF-16), but also works on other platforms (by using UTF-32).

Breaking changes to public API:

  • More permissive with the types accepted for archive filename,
    destination and password. (TBH could be more permissive if I didn't (pointlessly) try to avoid ownership inside Archive.)
  • Return PathBufs rather than Strings when representing filenames.

One thing still missing is that passwords are still Strings, as the old RARSetPassword interface only
supports chars. It's going to necessary to add support for UCM_NEEDPASSWORDW first, which requires some more changes to the callback.

list_missing_volumes fails on Windows, because it expects '/' as the path
separator.
Switch over to widestring and W/Ex variants of the unrar_sys functions
everywhere, fixing unicode issues at the cost of using u16/u32
instead of u8 types for strings. Though these are what the underlying
unrar library is also using.

Changed the public Archive API:
- More permissive with the types accepted for archive filename,
  destination and password.
- Return PathBufs rather than Strings when representing filenames.

Also tried to avoid taking ownership of filename inside Archive. But
despite the efforts, all the input strings/paths still end up getting
copied in the conversion to (Wide)CString.

Passwords are still Strings, as the old RARSetPassword interface only
supports char. It will be necessary to switch to using
UCM_NEEDPASSWORDW first. Apparently this is also required for archives
with encrypted headers.

Fixes muja#11, and partially addresses muja#4.
@muja
Copy link
Owner

muja commented Dec 19, 2019

Thank you very much for this PR! I believe this should also fix #4.

Some things to note:

  • This is a breaking change, so I should release a patch version before including this in a minor version release.
  • Why are you avoiding ownership inside Archive?

#[test]
fn unicode_list() {
let mut entries = Archive::new("data/unicode.rar").list().unwrap();
assert_eq!(entries.next().unwrap().unwrap().filename, PathBuf::from("te…―st✌"));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this string non-utf8? How would this work? Aren't &strs UTF-8? Wouldn't you need to use a Vec<u8>?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's UTF-8, I haven't yet come up with any good way of crafting a suitable UTF-16 test case.

@vjoki
Copy link
Contributor Author

vjoki commented Dec 20, 2019

The reason why I opted to borrowing over owning in Archive was because I didn't see much value in taking owned values, as they all get copied in the conversion to WideCString and dropped after. Also taking ownership would result in unnecessary (but probably cheap) cloning of borrowed types, while the tradeoff seemed more harmless.

To go in more detail, the options I considered were:

  • Using values (String, PathBuf), but it forces the user to clone borrowed and do manual conversions so I ruled it out.
  • Using AsRef<Path> + ?Sized + 'a stored as Cow<'a, Path> for filename, which is the most generic interface without unecessary cloning that I could come up with. It has the caveat of owned types needing to be passed as references. Accepting at least str, Path, OsStr, &PathBuf, &String, &OsString.
  • Using Into<PathBuf>, which would probably be the most invisible to the user, but results in unnecessary copying when passing borrowed types (str, Path, OsStr). Accepting at least str, Path, OsStr, PathBuf, String, OsString plus references.

password could also be &'a str (almost equivalent to the AsRef?) or Into<String>, with essentially the same considerations as with filename. There's still minor differences right now as String/CString is used, but that changes when UCM_NEEDPASSWORDW gets added (probably to AsRef<OsStr> or Into<OsString> converted to WideCString).

For destination path in Archive::extract_to, AsRef<Path> seems like the best option as it ought to be as generic as it gets (no extra constraints like what filename has).

So the decisions here boil down to whether to accept the cost of extra copy or the need to pass owned by reference. Personally, I still prefer the latter but the final call is yours.

@muja
Copy link
Owner

muja commented Dec 21, 2019

I think it's okay to get this merged for now. However, I'd like to wait before doing a 0.5 release as I've been planning to restructure the library for a while now but have never gotten around to it.

I'm not sure if you're aware, but I've talked about it in PR #10, too. My idea is to give the user more control over what to do with each entry by separating the read_header call and the process from one another. I'm trying to use a streaming iterator to iterate over the archive's entries.

Maybe you have some thoughts about that, too. I opened an issue: #20

@muja
Copy link
Owner

muja commented Dec 21, 2019

However, would it be possible to not panic if CString contains nul characters?

@muja
Copy link
Owner

muja commented Dec 21, 2019

Also (sorry for spamming), would you like to be included in the authors file? :) I think you more than deserve to be mentioned.

@vjoki
Copy link
Contributor Author

vjoki commented Dec 21, 2019

I can have it not panic, but I'm not quite sure what to do instead. Maybe return with UnrarError(Code::Unknown, ...) for now, until UnrarError gets refactored?

Also inclusion in the authors file is fine with me.

@muja
Copy link
Owner

muja commented Jan 10, 2020

I see what you're getting at, with the Code::Unknown and #22. A compromise in the meantime would be, before deciding on an error framework as a long term solution, one of these two options:

  1. convert the password into a CString in the with_password function and return a Result there
  2. same as 1) but panic instead of returning a Result (should be documented though).
  3. split the password at NUL char and take the first part

Options 1) or 2) seem like good solutions and are more honest with the caller. 3) seems like black magic and would possibly contain some element of surprise.

The same I would do for the filename (this would be in the constructor of Archive).
As for destination in extract_to, option 1) would not be ergonomic as it already returns an UnrarResult. This is the reason why I'm personally inclined towards 2).

We could also 1) Archive::with_password and Archive::new and 2) Archive::extract_to and add another function Archive::extract_to_checked which adheres to 1)... What do you think?

Sorry for bringing this up only after you refactored your PR to remove the panics.

@vjoki
Copy link
Contributor Author

vjoki commented Jan 11, 2020

No problem.

Doing 1) for password seems fine, but filename can get modified by as_first_part(). So the earliest point filename can be converted is in open().

I suppose it should be 1) for password, and 2) for the rest?

@muja
Copy link
Owner

muja commented Jan 12, 2020

Well if it really only panics if the String contains a nul character, we could just look for one and not wait for the open call to convert it? I'm not sure if I like inconsistent treatments of the same kind of error

@vjoki
Copy link
Contributor Author

vjoki commented Jan 12, 2020

I'd be fine with that, but it would leave some room for mistakes (as in nul getting inserted in filename somewhere in Archive). Which means you'll want to check for nul twice then?

There is from_os_str_unchecked, if you dare. :-)

- Add NulError, so that we can combine ffi::NulError and widestring::NulError.
- Check filename and password for nul values on Archive creation.
@vjoki
Copy link
Contributor Author

vjoki commented Jan 17, 2020

Well, I went ahead and gave it a go, but I think it might be better to just revert the last 2 commits and panic on all cases for now. Because doing the conversions in OpenArchive::new and failing there seems to be the cleanest option, but it's your call.

@muja
Copy link
Owner

muja commented Jan 17, 2020

I think for now this looks fine, especially since we explicitly document where the function might panic. #22 should be implemented though in the future to be able to handle these higher level errors gracefully. Thank you for your time! :)

@muja muja merged commit 3af9a60 into muja:master Jan 17, 2020
@muja
Copy link
Owner

muja commented Jan 17, 2020

@vjoki unrelated to this but do you prefer your full name or your alias for the authors list?

@vjoki
Copy link
Contributor Author

vjoki commented Jan 17, 2020

Alias would be fine.

@vjoki vjoki deleted the widestring branch January 24, 2020 23:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants